3. For a given dataset and linear model, what do you think is true about least squares estimates? Is Ŷ always unique? Yes. Is ˆβ always unique? No.

Size: px

Start display at page:

Download "3. For a given dataset and linear model, what do you think is true about least squares estimates? Is Ŷ always unique? Yes. Is ˆβ always unique? No."

Paulina McLaughlin
5 years ago
Views:

1 7. LEAST SQUARES ESTIMATION 1 EXERCISE: Least-Squares Estimation and Uniqueness of Estimates 1. For n real numbers a 1,...,a n, what value of a minimizes the sum of squared distances from a to each of the a i : n i=1 (a i a) 2? (prove) 2. Here are two datasets (given as (x,y)). For each dataset: Sketch a scatterplot of the data. What is the least squares line y i = β 0 +β 1 x i +ɛ i? That is, what is the line that minimizes the residual sum of squares. What is Ŷ? What is ˆβ? Dataset A: {(1,1),(1,2),(1,3),(1,5)}. Dataset B: {(1,1),( 1,2),(1,3),( 1,5)}. 3. For a given dataset and linear model, what do you think is true about least squares estimates? Is Ŷ always unique? Yes. Is ˆβ always unique? No.

2 2 7. LEAST SQUARES ESTIMATION 7.1 Least Squares Estimators Recall the linear model: Y 1 Y 2. Y n = Y = Xβ + ε x 10 x 11 x 1,p 1 x 20 x 21 x 2,p x n0 x n1 x n,p 1 β 0 β 1. β p 1 + ε 1 ε 2. ε n Definition: An estimate ˆβ is a least-squares estimate of β if it minimizes the length Y Xβ over all β. Note: least-squares is a mathematical criterion, not a statistical criterion Let x 0,x 1,...,x p 1 be the columns of X. Then β 0 Xβ = ( ) β x 0 x 1 x 1 p 1. β p 1 = β 0 x 0 + β 1 x β p 1 x p 1 R(X), the range (column space) of X. Questions: Why do we say a least-squares estimate instead of the least-squares estimate? If there is more than one leastsquares estimate, what is the geometric interpretation?

3 7. LEAST SQUARES ESTIMATION 3 A least-squares estimate can be found by finding a solution to the following minimization problem: 7.2 Orthogonal Projection Minimize Y θ over θ R(X). Lemma 7.2.1: Y can be uniquely decomposed as Y = Ŷ + ˆε where Ŷ R(X), ε [R(X)], [R(X)] = orthogonal complement of R(X) = {a : X a = 0} Definition: Ŷ is the orthogonal projection of Y onto R(X). It is also called the fitted vector or vector of fitted values. Y ˆε Ŷ

4 4 7. LEAST SQUARES ESTIMATION Proof: Existence: There must be one such decomposition because R(X) and [R(X)] span R n. Uniqueness: Suppose Y = Ŷ1 + ˆε 1, and Y = Ŷ2 + ˆε 2. then Ŷ1 Ŷ2 + ˆε 1 ˆε 2 = 0. Taking the inner product of this vector, we obtain 0 = (Ŷ1 Ŷ2 + ˆε 1 ˆε 2 ) (Ŷ1 Ŷ2 + ˆε 1 ˆε 2 ) = Ŷ1 Ŷ2 2 + ˆε 1 ˆε 2 ) (Ŷ1 }{{ Ŷ2) } (ˆε 1 ˆε 2 ) }{{} R(X) [R(X)] = Ŷ1 Ŷ2 2 + ˆε 1 ˆε 2 2 so that Ŷ1 Ŷ2 = 0 and ˆε 1 ˆε 2 = 0.

5 7. LEAST SQUARES ESTIMATION 5 Lemma 7.2.2: The orthogonal projection solves the least-squares minimization problem. Proof: For any θ R(X), (Y Ŷ) (Ŷ θ) = 0. Therefore, Y θ 2 = Y Ŷ + Ŷ θ 2 which is minimized by θ = Ŷ. = Y Ŷ 2 + Ŷ θ 2, Y Y Ŷ Ŷ θ We have just established that the vector in R(X) that is closest to Y ( closest according to least-squares) is the projection of Y onto R(X).

6 6 7. LEAST SQUARES ESTIMATION 7.3. Normal Equations Since Y Ŷ [R(X)], we know that This implies that X (Y Ŷ) = 0. X Y = X Ŷ. Since Ŷ R(X), we can write Ŷ = Xˆβ. So we have We have just proved: X Y = X Xˆβ. Lemma 7.3.1: A least squares estimate of β, denoted ˆβ, is a solution to the normal equations: X Xˆβ = X Y. Note: An alternative derivation of the normal equations uses derivatives to find a minimum of Y Xβ (Seber & Lee, p. 38).

7 7. LEAST SQUARES ESTIMATION Residual Vector Definition: The residual vector is ˆε = Y Ŷ = Y Xˆβ. Definition: The residual sum of squares is defined by RSS = ˆε ˆε n = i=1 ˆɛ 2 i = (Y Xˆβ) (Y Xˆβ)

8 8 7. LEAST SQUARES ESTIMATION 7.5. The Full Rank Case If rank(x n p ) = p, then X has full rank (largest possible assuming p n). Then rank(x X) = p (Seber & Lee, A2.4) so (X X) 1 exists. In this case the normal equations have the unique solution ˆβ = (X X) 1 X Y. The orthogonal projection (fitted vector) is Ŷ = Xˆβ = X(X X) 1 X Y = PY, where P = X(X X) 1 X. Note: P is sometimes called the hat matrix because PY = Ŷ. It is a projection matrix and it projects Y onto R(X). Lemma 7.5.1: Let P = X(X X) 1 X where X has full rank. (i) P and I P are projection matrices. (ii) rank(i P) = tr(i P) = n p. (iii) PX = X. Interpretation: P is projection onto R(X). I P is projection onto [R(X)]. For the residual vector we have: ˆε = Y Ŷ = Y PY = (I P)Y (Note : ˆε [R(X)] ), and for the residual sum of squares we can write: RSS = ˆε ˆε = Y (I P)Y.

9 7. LEAST SQUARES ESTIMATION The Less-Than-Full Rank Case Lemma: Let rank(x) = r < p and P = X(X X) X where (X X) is a generalized inverse of X X. Then (i) P and I P are projection matrices. (ii) rank(i P) = tr(i P) = n r. (iii) X (I P) = 0. Sketch of Proof: There is a unique matrix P such that ˆθ = PY (see Seber & Lee B1.2). One representation for P is P = X 1 (X 1X 1 ) 1 X 1 where X 1 consists of r linearly independent columns X. (i) Show P is idempotent and symmetric and therefore a projection matrix. P = X 1 (X 1 X 1) 1 X 1 = P2 = P (ii) rank(i P) = tr(i P) because I P is a projection matrix. But tr(i P) = tr(i) tr(p) = n tr(p), tr(p) = tr[x 1 (X 1 X 1) 1 X 1 ] = tr[(x 1 X 1) 1 X 1 X 1] = tr(i r r ) = r. (iii) This is equivalent to (I P)X = 0, or PX = X. This is clearly true since Px j = x j for every column of X, because x j R(X).

Basic Distributional Assumptions of the Linear Model: 1. The errors are unbiased: E[ε] = The errors are uncorrelated with common variance:

8. PROPERTIES OF LEAST SQUARES ESTIMATES 1 Basic Distributional Assumptions of the Linear Model: 1. The errors are unbiased: E[ε] = 0. 2. The errors are uncorrelated with common variance: These assumptions